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1.  Introduction 

The  major  issues  involved  in  modeling  modern  computer  systems  can  be  broadly 

classified  into  those  arising  from  the  model  construction,  model  reduction  and  solution,  _ 

and  in  the  interpretation  of  the  model  solution.  Modeling  languages  such  as  fault  trees, 
the  PMS  notation ,*  and  Extended  Stochastic  Petri  Netcrcan  be  valuable  in  simplifying 
the  task  of  model  construction.  The  goal  of  the  languages  is  to  provide  well  defined  con¬ 
structs  to  the  user  and  let  the  modeling  package  automatically  generate  the  details  of 
the  underlying  stochastic  model.  The  language  constructs  should  correspond  closely  to 
the  system  constructs,  and  yet  should  produce  a  concise  representation."4  C 

Specifying  the  relevent  details  of  the  system  being  modeled  can  require  a  tremen¬ 
dous  number  of  states  to  be  considered  (in  excess  of  100,000).  Techniques  must  be 
developed  to  reduce  the  model  to  one  that  is  computationally  tractable,  and  then  to 
solve  the  reduced  model  in  a  computationally  efficient  manner.  Once  the  solution  is 
obtained,  it  must  be  interpreted  carefully.  The  errors  introduced  by  the  model  reduction 
step  and  in  the  solution  must  be  bounded,  and  sensitivity  of  the  solution  with  respect  to 
input  parameters  should  be  estimated. .  — - 

We  have  made  considerable  progress  under  the  auspices  of  this  grant  in  both  model 
construction  techniques  and  model  reduction  and  solution  techniques.  This  progress  will 
be  outlined  in  the  next  two  sections. 


2.  Model  Construction 

Three  sets  of  inputs  are  necessary  to  construct  a  reliability  model  including  the  sys¬ 
tem  structure  and  fault-occurrence  behavior,  and  the  fault  and  error  handling  behavior. 
The  description  of  the  system  structure  (the  set  of  resources,  their  interconnections  and 
the  conditions  under  which  the  system  is  operational)  and  the  fault-occurrence  behavior 
determine  the  structure  of  the  dependability  model. 

Fault  trees  are  often  used  to  specify  the  conditions  under  which  a  system  fails,  and 
by  implication,  the  set  of  resources  and  their  interconnections.  A  fault  tree  is  a  logical 
diagram  that  describes  the  various  combinations  of  events  that  lead  to  the  undesirable 
top  event,  system  failure.  The  top  event  is  divided  into  its  constituent  events  (subsys¬ 
tem  failures),  which  are  then  similarly  subdivided.  The  lower  level  events  are  connected 
to  the  higher  level  events  by  the  means  of  Boolean  logic  gates.  The  lowest  level  events 
are  called  basic  events,  and  usually  correspond  to  the  failure  of  components.  Reliability 
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block  diagrams  are  similar  to  fault  trees  in  that  they  are  simple  to  understand  and  con¬ 
struct,  but  where  a  fault  tree  is  a  ’failure’  diagram,  the  reliability  block  diagram  is  a 
’success’  diagram.2  Each  component  or  subsystem  is  represented  by  a  block;  the  logical 
dependencies  are  represented  by  connections  between  the  blocks.  Each  path  between  the 
ends  represents  a  configuration  that  leaves  the  system  operational.  One  major  drawback 
to  both  fault  trees  and  reliability  block  diagrams  is  that  they  are  ’static’  diagrams;  they 
are  not  designed  to  model  dynamically  reconfigurable  systems  for  example. 

More  general  system  structure  characteristics  can  be  modeled  with  state  transition 
diagrams.  In  this  framework,  every  possible  state  of  the  system  must  be  enumerated 
and  classified,  as  well  as  the  transitions  between  the  states.  If  the  transition  rates  are 
constant  with  time,  then  the  resulting  state  transition  diagram  is  a  Markov  chain.5  The 
constant  transition  rates  imply  that  the  time  spent  in  each  state  is  exponentially  distri¬ 
buted.  If  the  transitions  between  the  states  depend  on  the  time  spent  in  the  individual 
state,  then  the  resulting  chain  may  be  semi-Markovian.6  Semi-Markov  processes  allow 
the  time  spent  in  each  state  to  be  generally  distributed;  this  generality  makes  the  solu¬ 
tion  of  all  but  the  smallest  models  difficult.  If  the  distributions  of  the  holding  time  in 
each  state  are  limited  to  exponential  polynomials  (a  very  minor  restriction),  it  can  be 
solved  much  more  easily.7' 8 

A  PMS  (processor-menory-switch)  diagram  is  a  higher  level  description  of  the  struc¬ 
ture  of  the  system;  it  shows  more  explicitly  the  components  and  their  physical  intercon¬ 
nections.  A  PMS  diagram  is  often  accompanied  by  a  set  of  ’assertions,’  a  listing  of  the 
requirements  that  must  be  fulfilled  for  the  system  to  be  operational.  The  PMS  diagram 
must  be  ’translated’  into  another  form  before  the  system  can  be  analyzed.2 

Performance  analysts  may  prefer  to  represent  the  system  in  terms  of  a  queueing 
network9  with  two  service  centers,  one  corresponding  to  the  failure  process  and  the  other 
corresponding  to  the  repair  process.  The  major  advantage  of  this  approach  is  that  the 
performance  analyst  can  form  the  model  in  a  familiar  language.  Also,  a  great  deal  of 
study  has  been  performed  on  queueing  networks. 

Yet  another  powerful  tool  for  describing  the  system  structure  is  the  extended  sto¬ 
chastic  Petri  net  (ESPN).3  The  ESPN  is  especially  useful  for  modeling  systems  that  exhi¬ 
bit  asynchronous  concurrent  activities,  and  is  more  general  than  the  other  languages 
mentioned.10  The  major  drawback  to  using  ESPN’s  is  that  it  may  be  difficult  for  the 
analyst  unfamiliar  with  the  intricasies  of  ESPN’s  to  develop  a  correct  representation  of 
the  system.  However,  its  generality  allows  us  to  develop  model  reduction  and  solution 
methods  for  the  ESPN,  with  the  knowledge  that  the  techniques  are  applicable  to  the 
other  model  types.  Further,  it  permits  us  to  study  the  relationships  between  the 
different  model  types.  An  ESPN  model  can  serve  as  an  ’intermediate’  language  for  com¬ 
parison  of  the  versatility  and  ease  of  specification  of  the  other  model  types.  A  com¬ 
parison  of  the  different  modeling  languages  is  necessary  to  objectively  determine  the  pros 
and  cons  of  each  one,  and  to  investigate  their  ranges  of  applicability.  This  may  allow  us 
to  define  new  constructs  for  the  other  languages,  to  increase  their  modeling  power  and 
applicability.  Then  each  analyst  may  continue  to  operate  within  a  familiar  and  comfort¬ 
able  environment,  without  sacrificing  versatility  or  speed. 

The  last  set  of  inputs  required  for  a  dependability  model  pertain  to  the  behavior  of 
the  system  upon  the  occurrence  of  a  fault.  This  part  of  the  model,  called  the 
fault/error-handling  model  (also  called  a  coverage  model)  may  include  such  behavior  as 
fault  and  error  detection,  transient  recovery  and  automatic  reconfiguration.  Several 
different  models  have  been  developed  to  represent  this  behavior,  most  of  which  are 
included  in  the  HARP  reliability  prediction  package.11- 12  A  major  difference  between 
modeling  system  structure  and  fault  occurrence  and  modeling  fault/error  handling  is 
that  specific  languages  have  been  developed  for  the  former,  while  specific  models  have 
been  developed  for  the  latter.  The  concept  of  modeling  coverage13' 14  is  a  fairly  recent 
one  whose  importance  is  just  beginning  to  be  appreciated. 
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Under  the  current  Air  Force  grant,  we  have  made  important  advances  in  defining 
modeling  languages,  such  as  the  Extended  Stochastic  Petri  Net  (ESPN)15  as  discussed  in 
this  section.  A  new  system  is  under  development  that  allows  the  hierarchical  definition 
of  models.  Each  subsystem  can  be  specified  and  combined  by  using  fault  trees,  reliabil¬ 
ity  block  diagrams,  Markov  chains,  semi-Markov  processes  and/or  stochastic  precedence 
graphs. 


3.  Model  Reduction  and  Solution 

Once  the  complete  description  of  the  system  being  modeled  is  generated,  and  the 
fault/error  handling  behavior  is  described,  the  resulting  model  is  often  too  large  and 
complex  to  solve.  Another  problem  that  will  frequently  arise  is  stiffness:  competing 
events  whose  time  constants  differ  by  many  orders  of  magnitude.  Stiffness  causes 
difficulties  in  both  numerical  and  simulative  solutions.  We  can  often  redude  the  model 
to  one  that  is  more  tractable,  by  exploiting  the  characteristic  that  makes  the  model  stiff. 
Informally,  we  can  decompose  the  model  into  two  submodels,  one  that  represents  the 
’fast’  behavior  and  the  other  the  ’slow.’  These  two  models  may  be  solved  separately, 
and  their  solutions  be  aggregated  into  the  overall  model  solution. 

One  such  model  reduction  technique  that  is  often  used  divides  the  model  into  dis¬ 
tinct  fault-occurrence  and  fault/error-handling  models.  This  technique,  termed 
behavioral  decomposition  has  been  utilized  in  CARE  II,16  CARE  III17  and  HARP.18- 19 
The  fault/error-handling  model  is  solved  in  (semi-)  isolation  for  coverage  factors,  which 
are  then  combined  with  the  system  structure  and  fault  arrival  information  for  solution  of 
the  overall  model. 

Another  technique  for  the  reduction  of  the  overall  model  has  been  presented  by 
Bobbio  and  Trivedi.2®  They  present  an  approximation  algorithm  for  systematically  con¬ 
verting  a  stiff  markov  chain  into  a  non-stiff  chain  with  a  smaller  state  space.  This 
method  works  on  the  matrix  representation  of  the  Markov  chain,  rather  than  interpret¬ 
ing  the  underlying  behavior  of  the  system  being  modeled.  Obviously,  the  problem  of 
model  reduction  needs  to  be  studied  and  extended  to  other  model  types,  and  applicabil¬ 
ity  of  the  various  techniques  must  be  investigated.  We  are  continuing  a  serious  study  of 
decomposition/aggregation  methods  applicable  to  large,  stiff  Markov  reliability  and  avai¬ 
lability  models. 

When  the  model  is  reduced  to  an  acceptable  size,  the  most  appropriate  solution 
technique  must  be  chosen.  An  analytic  solution  is  desirable  since  it  is  often  the  fastest 
and  the  most  efficient.  A  combinatorial  solution  is  often  used  when  the  input  is  specified 
in  terms  of  a  fault  tree  or  reliability  block  diagram.  To  predict  the  reliability  of  a  sys¬ 
tem  at  some  time  t ,  this  solution  method  considers  the  combinations  of  events  that 
cause  the  system  to  fail  (or  remain  operational)  and  assign  a  probability  to  each  combi¬ 
nation.  A  more  general  combinatorial  method  has  been  implemented  in  SPADE7  in 
which  t  remains  symbolic.  The  system  does  not  need  to  be  re-solved  for  each  value  of  t 
for  which  the  solution  is  desired.  Also,  the  times  of  interest  may  be  generally  distributed 
(exponential  polynomials).  The  SPADE  solution  method  is  applicable  to  fault  trees  and 
reliability  block  diagrams.  In  fact,  it  is  applicable  to  any  system  that  can  be  specified  as 
a  directed  acyclic  graph. 

Recently,  we  have  developed  a  general  model  that  allows  subsystems  to  be  specified 
as  fault  trees,  reliability  block  diagrams,  stochastic  precedence  graphs  and/or  semi- 
Markov  processes.  The  solution  method  developed  earlier  for  SPADE  extends  to  such  a 
hybrid  model  and  combines  the  efficiency  of  combinatorial  approaches  and  the  versatility 
of  a  Markovian  approach. 

A  markov  chain  produces  a  set  of  ordinary  differential  equations 


-  4- 


P 1  (t)  =  P(t)A{t)  P(0)  =  P, 

where  P(t)  is  the  probability  vector  for  operational  states  and  A  (t)  is  the  associated 
matrix  of  (possibly)  time  dependent  transition  rates.  This  analytic  model  is  then  solved 
numerically  for  the  state  probabilities  P,  ( t ).  The  reliability  or  availability  of  the  system 
is  then  given  by  the  sum  of  state  probabilities  for  operational  states.  In  a  reliability 
model  the  failure  states  are  absorbing,  while  for  an  availability  model,  repair  can  cause  a 
transition  from  a  failure  state  to  an  operational  state.  We  have  generally  used  a  Runge- 
Kutta  Fehlberg  type  quadrature  routine  to  solve  the  set  of  equations  associated  with  a 
Markov  chain,  but  have  recently  begun  serious  study  of  numerical  methods  more  suit¬ 
able  for  the  specific  kinds  of  matrices  associated  with  stochastic  systems. 

In  order  to  analytically  combine  the  study  of  performance  and 
reliability /availability,  a  Markov  reward  process  is  often  used.  In  these  models,  a  reward 
(relating  the  performance  of  the  system  to  the  structure)  is  associated  with  each  state. 
Kulkarni,  Nicola  and  Trivedi  have  proposed  a  unified  model  that  relates  performance 
and  reliability  measures  for  the  analysis  of  fault- tolerant  systems.21  The  solution  of  the 
reward  process  is  given  in  terms  of  double  transforms  (one  for  the  time  variable,  and  the 
second  for  the  reward  variable).  Since  analytical  inversion  is  not  tractable,  they  resort 
to  a  hybrid  analytical-numerical  approach  for  the  inversion  of  the  double  transforms. 
The  computational  procedure  involves  the  numerical  evaluation  of  the  roots  of  a  polyno¬ 
mial  followed  by  an  analytic  inversion  with  respect  to  the  Laplace  transform  variable. 
The  Laplace-Stieltjes  inversion  is  then  carried  out  numerically.22 

Often  a  simulative  solution  is  preferable  to  an  analytic  one,  especially  if  the  model 
includes  concurrency  (and  non-exponential  distributions).  If  the  model  is  phrased  in 
terms  of  an  ESPN,  it  can  be  simulated  using  DEEP  (the  Duke  ESPN  Evaluation  Pack¬ 
age).10'  15  DEEP  provides  either  a  transient  (for  reliability  analysis)  or  steady  state  (for 
availability  or  performance  analysis)  solution.  The  major  advantage  to  simulating  a  sys¬ 
tem  for  solution  is  the  flexibility  that  is  possible.  The  major  disadvantage  to  simulation 
arises  when  trying  to  solve  a  stiff  system.  Many  simulation  trials  are  needed  if  the 
model  includes  very  rare  events. 

We  are  investgating  techniques  for  more  efficient  simulation  of  stiff  systems,  in 
which  the  occurrence  (or  non-occurrence)  of  rare  events  is  recognized.  The  rare  events 
can  then  be  forced  to  occur  in  the  simulation.  The  statistical  analysis  of  the  simulation 
runs  would  then  "weigh”  the  results  accordingly.  Developing  techniques  for  the  ESPN 
model  assures  us  that  the  techniques  are  applicable  to  the  other  model  types,  since  they 
are  all  special  cases  of  an  ESPN  model. 

In  many  cases  neither  a  simulation  model  nor  a  analytic  model  are  sufficient  to 
include  all  the  system  aspects  in  one  model.  In  this  case  a  hybrid  model,  a  judicious 
combination  of  simulation  and  analytic  models  may  be  used.  HARP  is  such  a  hybrid 
model,  since  the  fault  handling  model  might  be  simulated,  but  the  aggregated  Markov 
model  is  solved  analytically  (numerically).  The  interface  between  the  hybrid  parts  of 
such  a  model  must  be  designed  carefully. 

Thus  we  have  made  major  advances  in  solving  complex  perfromability  models23 
and  in  deriving  a  hybrid  combinatorial-Markov  model  for  solving  complex  realistic 
models. 


4.  Interpretation  of  the  Solution 

There  are  errors  involved  in  any  type  of  modeling  for  system  evaluation.  It  is 
necessary  to  identify  all  assumptions  and  sources  of  error  in  model  prediction,  define 
proof  procedures  to  verify  or  experiments  to  validate  the  assumptions.  In  case  the 
assumptions  are  not  supported  by  these  procedures,  either  the  model  needs  to  be 
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modified  or  errors  in  model  predictions  need  to  be  bounded. 

Often  the  values  of  the  transition  rates  are  known  to  lie  within  a  certain  range  of 
values,  with  a  very  high  probability.  Also  there  may  be  a  positive  (although  very  small) 
probability  that  the  initial  state  of  the  system  does  not  correspond  to  the  initial  state  of 
the  model.  In  these  cases  we  are  interested  then  in  the  range  of  values  between  which 
the  reliability  lies,  rather  than  a  point  estimate.  Smotherman  has  devised  a  technique24 
for  converting  a  complex  reliability  model  to  a  much  simpler  model.  This  simple  model 
can  then  be  used  to  bound  the  final  result  with  respect  to  the  parametric  sensitivity. 

In  addition  to  point  estimates  of  such  metrics  as  availability  and  mean  time  to 
failure,  SAVE9  produces  an  estimate  of  the  sensitivity  of  the  estimate  to  various  input 
parameters.  The  user  can  then  have  an  idea  as  to  which  system  parameters  are  most 
crucial  to  the  operation  of  the  system.  Parametric  sensitivity  measures  can  also  be  use¬ 
ful  when  optimizing  a  system  with  respect  to  reliability,  performance,  cost,  etc. 

We  are  continuing  a  study  of  error  bounding  techniques  and  sensitivity  analysis  for 
complex  reliability  and  availability  models. 
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